Goto

Collaborating Authors

 depth efficiency


Limits to Depth Efficiencies of Self-Attention

Neural Information Processing Systems

Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: Empirical signals indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). In this paper, we theoretically study the interplay between depth and width in self-attention. We shed light on the root of the above phenomenon, and establish two distinct parameter regimes of depth efficiency and inefficiency in self-attention. We invalidate the seemingly plausible hypothesis by which widening is as effective as deepening for self-attention, and show that in fact stacking self-attention layers is so effective that it quickly saturates a capacity of the network width. Specifically, we pinpoint a ``depth threshold that is logarithmic in the network width: for networks of depth that is below the threshold, we establish a double-exponential depth-efficiency of the self-attention operation, while for depths over the threshold we show that depth-inefficiency kicks in. Our predictions accord with existing empirical ablations, and we further demonstrate the two depth-(in)efficiency regimes experimentally for common network depths of 6, 12, and 24. By identifying network width as a limiting factor, our analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.


Limits to Depth Efficiencies of Self-Attention

Neural Information Processing Systems

Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: Empirical signals indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). In this paper, we theoretically study the interplay between depth and width in self-attention. We shed light on the root of the above phenomenon, and establish two distinct parameter regimes of depth efficiency and inefficiency in self-attention. We invalidate the seemingly plausible hypothesis by which widening is as effective as deepening for self-attention, and show that in fact stacking self-attention layers is so effective that it quickly saturates a capacity of the network width. Specifically, we pinpoint a depth threshold" that is logarithmic in the network width: for networks of depth that is below the threshold, we establish a double-exponential depth-efficiency of the self-attention operation, while for depths over the threshold we show that depth-inefficiency kicks in.


Reviews: Deep Homogeneous Mixture Models: Representation, Separation, and Approximation

Neural Information Processing Systems

The paper discusses connections between multiple density models within the unifying framework of homogeneous mixture models: tensorial mixtures models [1], hidden Markov models, latent tree models and sum-product networks [2] are discussed. The authors argue that there is a hierarchy among these models by showing that a model lower in the hierarchy can be cast into a model higher in the hierarchy using linear size transformations. Furthermore, the paper gives new theoretical insights in depth efficiency in these models, by establishing a connection between properties of the represented mixture coefficient tensor (e.g. Finally, the paper gives positive and somewhat surprising approximation results using [3]. Strengths: connections between various models, which so far were somewhat folk wisdom, are illustrated a unifying tensor mixture framework.